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Abstract 

Canonical correlation analysis (CCA) has proven an effective tool for two-view dimension reduction 
due to its profound theoretical foundation and success in practical applications. In respect of multi-view 
learning, however, it is limited by its capability of only handling data represented by two-view features, 
while in many real-world applications, the number of views is frequently many more. Although the 
ad hoc way of simultaneously exploring all possible pairs of features can numerically deal with multi¬ 
view data, it ignores the high order statistics (correlation information) which can only be discovered by 
simultaneously exploring all features. 

Therefore, in this work, we develop tensor CCA (TCCA) which straightforwardly yet naturally gen¬ 
eralizes CCA to handle the data of an arbitrary number of views by analyzing the covariance tensor 
of the different views. TCCA aims to directly maximize the canonical correlation of multiple (more 
than two) views. Crucially, we prove that the multi-view canonical correlation maximization problem is 
equivalent to finding the best rank-1 approximation of the data covariance tensor, which can be solved 
efficiently using the well-known alternating least squares (ALS) algorithm. As a consequence, the high 
order correlation information contained in the different views is explored and thus a more reliable com¬ 
mon subspace shared by all features can be obtained. In addition, a non-linear extension of TCCA is 
presented. Experiments on various challenge tasks, including large scale biometric structure prediction, 
internet advertisement classification and web image annotation, demonstrate the effectiveness of the 
proposed method. 


1 Introduction 

The features utilized in many real-world data mining tasks are frequently of high dimension and extracted 
from multiple views (or sources). For example, both the page content and hyperlink represented by bag-of- 
words (BOW) are usually used in web page classification Blum and Mitchell (1998); Foster et al (2008), and 
it is common to combine the global (such as GIST Oliva and Torralba (2001)) and local (such as SIFT Lowe 
(2004)) descriptors in image annotation Chua et al (2009); Guillaumin et al (2009). In these applications, 
the features can have dimensions of up to several hundred or thousand. 
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Figure 1: The tensor CCA motivation. Only the pairwise correlation is explored in the traditional exten¬ 
sions of CCA, while much more information (i.e., the high order correlation) that can only be obtained by 
simultaneously examining all views is explored in the proposed TCCA. 


Multi-view dimension reduction Foster et al (2008) seeks a low-dimensional common subspace to com¬ 
pactly represent the heterogeneous data, in which each of the data examples is associated with multiple 
high-dimensional features. It often benefits the subsequent learning process significantly in that the curse- 
of-dimensionality is alleviated and the computation-al efficiency is improved Hou et al (2010); Han et al 
(2012). Canonical correlation analysis (CCA), which is designed to inspect the linear relationship between 
two sets of variables Hardoon et al (2004); Bach and Jordan (2005), was formally introduced as a multi-view 
dimension reduction method in Foster et al (2008), where the authors prove that the labeled instance com¬ 
plexity can be effectively reduced under certain weak assumptions. In addition, CCA has been widely used 
for multi-view classification Farquhar et al (2005), regression Kakade and Foster (2007), clustering Blaschko 
and Lampert (2008); Chaudhuri et al (2009), etc. Theoretically, Bach and Jordan Bach and Jordan (2005) 
interpreted CCA probabilistically as a latent variable model, and thus it is able to be involved in a larger 
probabilistic model. 

In spite of the profound theoretical foundation and practical success of CCA in multi-view learning, it 
can only handle data that is represented by two-view features. The features utilized in many real-world 
applications, however, are usually extracted from more than two views. For example, different kinds of 
color, texture and shape features are popular used in visual analysis-based tasks such as image annotation 
and video retrieval. A typical approach for generalizing CCA to several views is to maximize the sum of 
pairwise correlations between different views Vfa et al (2007). The main drawback of this strategy is that 
only the statistics (correlation information) between pairs of features is explored, while high-order statistics 
that can only be obtained by simultaneously examining all features is ignored. 

To tackle this problem, we develop tensor CCA (TCCA) to generalize CCA to handle an arbitrary num¬ 
ber of views in a straightforward and yet natural way. In particular, TCCA aims to directly maximize the 
correlation between the canonical variables of all views, and this is achieved by analyzing the high-order 
covariance tensor over the data from all views. We prove that maximizing the correlation is equivalent to 
approximating the covariance tensor with a rank-1 tensor in an optimal least square sense. This approx¬ 
imation has been investigated in the literature and an efficient alternating least square (ALS) algorithm 
can be adopted for optimization Kroonenberg and De Leeuw (1980); De Lathauwer et al (2000b); Comon 
et al (2009). With respect to the traditional pairwise correlation maximization, the statistics (correlation 
information) explored can be measured using the m{m — l)/2 covariance matrices of size (d^), where m is 
the number of views and d represents the average feature dimensions, whereas in the proposed TCCA, the 
size of the covariance tensor is Fig. 1 is an illustrative example, where m = 3. Much more correlation 

information is encoded in the common subspace shared by all features in multi-view dimension reduction, 
and thus hopefully better performance can be achieved. Furthermore, we extend the proposed TCCA to 
the non-linear case, which is useful when the feature dimensions are very high and limited instances are 
available. We perform extensive experiments on a variety of challenge tasks, including large scale biometric 
structure prediction, internet advertisement classification and web image annotation. We compare the pro¬ 
posed method with the traditional CCA Foster et al (2008) and its multi-view extension Vfa et al (2007), as 
well as two representative unsupervised multi-view dimension reduction approaches Long et al (2008); Han 
et al (2012). The results confirm the effectiveness of the proposed TCCA. 
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The article is organized as follows. We summarize closely related works in Section 2. A brief introduction 
of CCA and its traditional multi-view extension is presented in Section 3. Section 4 includes the description, 
formulation, and analysis of the proposed TCCA, as well as its non-linear extension kernel TCCA (KTCCA) 
for multi-view dimension reduction. Extensive experiments are presented in Section 5 and the paper is 
concluded in Section 6. 


2 Related Work 

2.1 Multi-view Dimension Reduction 

Dimension reduction is a key technique in machine learning. The goal of dimension reduction is to find a 
low dimensional representation for high dimensional data Xia et al (2010). Feature selection and feature 
transformation and the two main approaches for dimension reduction. The former aims to select a subset 
of variables from the original, while the latter transforms the data to a new space of fewer dimensions. The 
dimension reduction can be performed in an either unsupervised (e.g., principal component analysis (PCA) 
and Laplacian eigenmaps (LE) Belkin and Niyogi (2001)), semi-supervised Benabdeslem and Hindawi (2014), 
or supervised (e.g., linear discriminant analysis (EDA)) setting, differed in the amount of labeled information 
being utilized. 

In another research line, multi-view learning has attracted much attention recently. The multi-view we 
refer to here is the multiple feature representations of an object, not the spatial viewpoints in some other 
vision and graphics applications Su et al (2009). We generally classify the multi-view learning algorithms 
into three families: weighted view combination Lanckriet et al (2004); McFee and Lanckriet (2011), multi¬ 
view dimension reduction Hardoon et al (2004); White et al (2012), and view agreement exploration Blum 
and Mitchell (1998); Kumar et al (2011). Multi-view dimension reduction focuses on removing irrelevant or 
redundant information Benabdeslem and Hindawi (2014) and reducing the feature dimension of data that 
consists of multiple views by leveraging the dependencies, coherence, and complementarity of those views. 
The different views are often assumed to be conditionally independent, thus a latent representation shared 
by all views can be obtained by exploiting the conditional independence structure of the multi-view data 
Foster et al (2008); Long et al (2008); White et al (2012); Han et al (2012); Chen et al (2012). For example, 
canonical correlation analysis (CCA) is employed for multi-view dimension reduction in Foster et al (2008) 
to exploit the underlying conditional independence and redundancy assumption in multi-view learning. A 
general unsupervised learning method is presented in Long et al (2008) for multi-view data, where a consensus 
representation is learned by first applying dimension reduction technique (such as spectral embedding Belkin 
and Niyogi (2001)) on each view and then combining the results via matrix factorization. In Han et al (2012), 
the structured sparsity Jenatton et al (2011) is enforced among the different views in the learning of low- 
dimension consensus representation, to allow information being shared across subsets of features adaptively. 
In contrast to unsupervised multi-view dimension reduction, the similarity/dissimilarity pairwise constraints 
are utilized in Hon et al (2010) for semi-supervised multi-view dimension reduction. In Chen et al (2012), 
the supervising information is also incorporated in the learned latent shared subspace by the use of a large- 
margin latent Markov network. In these methods, local optimal subspace can usually be obtained. Therefore, 
White et al. White et al (2012) proposed a convex formulation for learning a shared subspace of multiple 
sources. In the learned subspace, conditional independence constraints are enforced. 

2.2 Canonical Correlation Analysis and Its Extensions 

Canonical correlation analysis (CCA), originally proposed by Hotelling (1936), finds bases for two random 
variables (or sets of variables) so that the coordinates of the variable pairs projected on these bases are 
maximally correlated Hardoon et al (2004). Much success has been achieved by applying CCA to pattern 
recognition and data mining. For example, SVM-2K was proposed in Farquhar et al (2005) for two-view 
classification. It combines kernel CCA and support vector machine (SVM) in a single optimization problem, 
and the authors prove that the Rademacher complexity of SVM-2K is significantly lower than the individual 
SVMs. Kakade and Foster Kakade and Foster (2007) presented a multi-view regression algorithm regularized 
with a norm that is derived by applying CCA on unlabeled data. The authors show that the intrinsic 
dimension of the regression problem with the induced norm can be characterized by the correlation coefficients 
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obtained in CCA. Under the conditionally uncorrelated assumption, a simple and efficient subspace learning 
algorithm based on CCA was proposed in Chaudhuri et al (2009) for multi-view clustering. The algorithm 
was shown to work well under much weaker separation conditions than the previous clustering methods. 

In addition to these applications, there have been dozens of developments for CCA, most of which 
concentrate on inspecting the relationship between two sets of tensors rather than vectors. For example, the 
classical CCA was extended in Lee and Choi (2007) to 2D-CCA, which directly analyzes 2D images without 
reshaping them into vectors. Some of its extensions are local 2D-CCA Wang (2010), sparse 2D-CCA Yan 
et al (2012), and multilinear CCA (MCCA) Lu (2013). Considering that the two high-order tensors to be 
studied may share multiple modes (e.g., the video volume data), Kim and Cipolla Kim and Cipolla (2009) 
presented two architectures for tensor correlation maximization by applying canonical transformation on 
the non-shared modes. In this way, features that have a good balance between flexibility and descriptive 
power may be obtained. This method is also termed “tensor CCA” (TCCA), but is quite different from 
the approach proposed in this paper. The main difference lies in that the latter focuses on analyzing two 
high-order tensor data sets, while our objective is to analyze the high-order statistics among multiple vector 
data sets (views). 

The most closely related works to our methods, as far as we are concerned, are the maximum variance 
CCA (CCA-MAXVAR) Kettenring (1971) and an adaptive CCA algorithm termed CCA-LS Vfa et al (2007), 
which is based on least square (LS) regression. The CCA-MAXVAR algorithm is performed by weighted 
combination of the canonical variables (projected vectors) of all views to approximate a latent common rep¬ 
resentation. This approach requires costly singular value decomposition (SVD) for optimization and cannot 
be trained in an adaptive fashion. To avoid these drawbacks. Via et al. Vfa et al (2007) reformulated CCA- 
MAXVAR as a set of coupled LS regression problems, which seeks to minimize the distance between each 
pair of canonical variables. The reformulation is proved to be equivalent to the original CCA-MAXVAR for¬ 
mulation, but is much more efficient and can be learned adaptively. Nevertheless, there is still a disadvantage 
to both CCA-LS and CCA-MAXVAR, namely that only the pairwise correlations are exploited, while the 
high order correlations between all views are ignored. We developed the following tensor CCA framework 
to rectify this shortcoming. 


3 Canonical Correlation Analysis (CCA) and Its Multi-view Gen¬ 
eralization 


This section briefly introduces standard canonical correlation analysis (CCA) and its traditional generaliza¬ 
tions on several data sets Kettenring (1971); Vfa et al (2007). Given two sets of column vectors xi^ G 
X 2 n ^ ^ = 1,..., Y. The objective of CCA is to find a pair of projections (usually called canonical 

vectors) hi, h 2 , such that correlations between the two vectors of canonical variables zi G and Z 2 G 
with each zin = x^^hi, Z 2 n = x^^h 2 , are maximized. The optimization problem is thus given by 


argmax p = corr(zi, Z 2 ) 

Zl ,Z2 


hf Ci2h2 


(3.1) 


where Cn = XiXj^ C 22 = ^ 2 X 2 are data variance matrices, and C 12 = X 1 X 2 is the covariance matrix. 
Here, Xi G and X 2 G stacked data matrices. The optimization of problem (3.1) leads 

to the main solution of CCA, and the remaining solutions are given by maximizing the same correlation 
under the constraint of being orthogonal to the previous solutions. 

CCA-MAXVAR Kettenring (1971) generalizes CCA to m views. Suppose the data matrix for the p’th 
view is Xp G then the optimization problem of CCA-MAXVAR for finding the canonical vectors 

{K}T=i 


^ fit 

argmin —'^\\z - apZp\\l, 
S.t. ||Zp||2 = 1, 


(3.2) 


where Zp = X^hp is the vector of canonical variables, z is the best possible one-dimensional PC A represen¬ 
tation, and a = [ai,..., o^m]^ is the vector of combination weights. To avoid a trivial solution, an additional 
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Figure 2 : System diagram of the multi-view dimension reduction method by the use of the proposed TCCA. 
Firstly, different kinds of features are extracted to represent the available instances in different views. Then 
a covariance tensor is calculated on the obtained representations Xp^p = 1 ,..., m to discover the correlation 
information between all views. By approximating the covariance tensor with a set of rank-1 tensors, we 
obtain the transformation matrix Up for the p’th view. Each Up maps the original Xp to the low dimensional 
Zp in the common subspace, and the final representation is a concatenation of Zp^p = 1 ,..., m. 


constraint such as ||(ap ||2 = ^ is enforced. The solutions of (3.2) can be obtained using the SVD of Xp. To 
develop an efficient and adaptive algorithm. Via et al. Vfa et al (2007) reformulated (3.2) as 


^ m 

.zui y 


^ m 

t. y Upphp = 1 . 

m ^ ^ 


p=i 


(3.3) 


The orthogonal constraint ^ j is imposed on the different solutions, which can be obtained 

by using an iterative algorithm based on LS regression Vfa et al (2007). Here, = X Zp ^ and Zp ^ is 

a vector of canonical variables projected using the Fth canonical vector in the p’th view. 


4 Tensor Canonical Correlation Analysis (TCCA) 

In contrast to CCA-MAXVAR Kettenring (1971) and CCA-LS Vfa et al (2007), where only the pairwise 
correlations are considered, we propose tensor CCA (TCCA) for multi-view dimension reduction by exploiting 
the high-order tensor correlation between all views. The diagram of the multi-view dimension reduction 
method using the proposed TCCA is shown in Fig. 2 . Different kinds of features, such as LAB color 
histogram (LAB), wavelet texture (WT), and the local SIFT features (SIFT), are first extracted to represent 
the instances in different views. This leads to multiple feature matrices {Xp G Here, m is set at 

3 for intuitive illustration without loss of generality. The different sets of features are then used to calculate 
the data covariance tensor C 123 , which is subsequently decomposed as a weighted sum of rank -1 tensors, i.e., 
Ci23 ~ Ylk=i o o where r < min((ii, ^ 2 , ds) is the reduced dimension and o is the tensor 

(outer) product. The vectors are stacked as a transformation matrix Up^ which is used to map 
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the original high dimensional features into the low dimensional common subspace. The projected features 
are concatenated as the final representation of the instances. The details of this technique are given 
below, but first we briefly introduce several useful notations and concepts of multilinear algebra. 

4.1 Notations 

Let A be an m-order tensor of size /i x /2 x ... x and be a Jp x matrix. The p-mode product of A 
and U is then denoted diS B = A XpU, which is an /i x ... Ip_i x Jp x Jp+i x tensor with the element 

Im 

B{i\^ ..., ip—i^ jp^ 'i'p-\-i 5 • • • 5 '^m) — E A{ii,i2,---,im)U{jp,ip). (4.1) 

ip = l 

The product of A and a sequence of matrices {Up G is a Ji x J2 X ... X tensor denoted by 

B = AXiUiX2U2...Xm Um- (4.2) 

The mode-p matricization of A is denoted as an Ip x matrix which is obtained by 

mapping the fibers associated with the p’th dimension of A as the rows of , and aligning the corresponding 
fibers of all the other dimensions as the columns. Here, the columns can be ordered in any way. The p-mode 
multiplication B = AXpU can be manipulated as matrix multiplication by storing the tensors in metricized 
form, i.e., B(^p^ = UA(^py Specifically, the series of p-mode product in (4.2) can be expressed as a series of 
Kronecker products and is given by 

B(p) = UpA^p){U^^^-^^ (g) O ... (8) (4.3) 

where {ci, C 2 ,..., cl} = {p + l,p + 2,..., m, 1, 2,... ,p — 1} is a forward cyclic ordering for the indices of 
the tensor dimensions that map to the column of the matrix. Finally, the Frobenius norm of the tensor A is 
given by 

h h Im 

\\A\\% = {A,A) = E-- - E A{h,i2,...,imf. (4.4) 

il=l 22 = 1 *m = l 


4.2 Problem Formulation 

Given m views of N instances, and each Xp = [xpi, Xp 2 , • • •, x^at] G is assumed to have been 

centered (i.e., have zero mean). The variance matrices are then 


N 


^pp~ nYI 


^pn^pn 




n=l 


and the covariance tensor among all views is calculated as 


Cl2... 


m 


N 




OX2ri 


n=l 


^ ^mn 1 


where C is a tensor of dimension di x ^2 x ... x dm- Following the objective of the traditional two-view 
CCA Hardoon et al (2004), the proposed tensor CCA seeks to maximize the correlation between the canon¬ 
ical variables Zp = Xjhp,p = l,...,m, where {hp G are usually called the canonical vectors. 

Therefore, the optimization problem is 


argmaxp =corr(zi,Z 2 ,... ,z^) 

{hp} (4^5) 

s.t. ZpZp = l,p = 1,... ,m. 

Here corr(zi, Z 2 ,..., z^) = (zi 0 Z 2 0 ... 0 z^)^e is the canonical correlation, and 0 is the element-wise 
product, e G is an all ones vector. We can prove that it is equivalent to Ci 2 ...m Xi hf X 2 ... x^ h^, 
where Xp is the p-mode tensor-matrix product. 
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Theorem 1. The high order eanonieal eorrelation is given by 

p= (ziOz 2 ©...©z„)^e = Ci 2 ,..m Xihf X 2 h|’... X„h^. (4.6) 

The proof is presented in the Appendix. By further considering that XpX^ = Cpp,p = 1,... ,m, the 
problem (4.5) becomes 

argmaxp =Ci2...m Xi X 2 ... x„ 

{hp} (4_7) 

s.t. hpCpphp = l,p = 1,... ,m. 

We further add a regularization term in the constraints to control the model complexity, and thus the 
constraints of problem (4.7) become 

hp(C'pp + e/)hp = l,p= l,...,m, (4.8) 

~ 1 /2 

where I is an identity matrix and e is a nonnegative trade-off parameter. Let each Up = Cpp hp and 
M = Cl2...m Xl cT" X2 • • • Xm Cmrn‘^^ wc Can reformulate (4.7) as 

argmax p =A4 x i uf x 2 ... x 

{Up} (4,9) 

s.t. Up Up = l,p = 1,... ,m, 

where Cpp = Cpp + el. The equivalence of the problem (4.7) and (4.9) is ensured by the following theorem. 
Theorem 2. The problems and (4-9) are equivalent. 

Proof. It is straightforward that the constraints of problems (4.7) and (4.9) are equivalent, and now we prove 
that the objective of the two problems is the same as follows, 

^ 12 ...m Xl h]^ X 2 h2 X^ 

=h^C'(„)(h„_i © ... 0 h2 0 hi) 

=UmC'mm^qm)((^mipm-lUm-l) 0 • • • 0 
='^l^{C;nll^C(rn){C:;Tlrn-l ® ® 0 ... 0 Ui)) 

=u^A4(um_i 0 ... 0 U2 0 Ui) 

=M Xl X2 ... X„ u^, 

where the metricizing property of the tensor-matrix product presented in (4.3) and some basic properties of 
the Kronecker product are applied. □ 

4.3 Solutions 

It has been presented in De Lathauwer et al (2000b) that the problem (4.9) is equivalent to finding the best 
rank-1 approximation of the tensor At, i.e., if we define At = pui o U 2 o ... o then the optimization 
problem becomes 

argmin ||At — M\\f^ (4-10) 

(Up) 

The solution can be obtained using the alternating least square (ALS) algorithm Kroonenberg and De Leeuw 
(1980); Comon et al (2009). Some other algorithms, such as the high-order power method (HOPM) De Lath¬ 
auwer et al (2000b) and the tensor power method Allen (2012), can also be applied here for optimization, 
but our empirical findings indicate that the ALS algorithm performs the best in our experiments. 

As in the two-view CCA, we perform a recursive maximization of the correlation between linear combina¬ 
tions of Xp,p = 1,..., m. However, we cannot expect the different linear combinations h^\ ..., of Xp 
to be uncorrelated with each other, where r is the rank of M. (the determination of the rank value is still an 
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open problem for the high-order tensors De Lathauwer et al (2000a)). That is, the orthogonality constraints 
cannot be imposed on ..., since the sum of rank-1 decomposition and orthogonal decomposition 
of high-order tensors cannot be satisfied simultaneously De Lathauwer et al (2000a). 

Based on the solutions u^, we obtain the canonical variables Zp = Xjhp = Cpp^‘^Up. Let Up = 
[u^\ ..., and z^\ ..., z^^ be the column vectors of Zp, we obtain the projected data for the p’th view: 

= ( 4 . 11 ) 

Following Foster et al (2008), where it is suggested that the dimension be reduced to 2r in the standard 
CCA, we concatenate the different {Zp} as the final representation Z G for the subsequent learning, 

such as classification Farquhar et al (2005); Fisch et al (2014), clustering Yang et al (2014); Wu et al (2015), 
regression Kakade and Foster (2007), search ranking Xu et al (2015); Zhu et al (2015), collaborative filtering 
Liu et al (2014), and so on. 

4.4 Non-linear Extension 

The projections {hp} are linear in TCCA and thus may be not appropriate for instances that he in quite 
non-linear feature space. To this end, we develop kernel tensor CCA (KTCCA) that extends the proposed 
TCCA to the non-linear case. KTCCA aims to find non-linear projections by first projecting the data into 
higher dimensional space induced by the feature mapping 0: 


(j){Xp) = [4>{xpi),4){xp2),4 >{^pn)] e 

where the mapped dimension Dp may be infinite. Then the variance matrices 

1 ^ 

^PP ^ {'>^pn),P = 

n=l 

the covariance matrix 

1 ^ 

Cl2...m = ° </>(x2n) o • • • o </>(Xm„), 

n=l 

and the canonical variables Zp = cj)^{Xp)h.p. It follows from the Representer Theorem Scholkopf and Smola 
(2002) that hp can be rewritten as a linear combination of the given instances, i.e., 

hp = 0(Yp)ap, (4.12) 

where ap G is a vector of the combination coefficients. The problem (4.7) then becomes 

argmaxp =ICi 2 ...m Xi af X 2 

(4.13) 

s.t. apKpphp = l,p=l,...,m, 

where Kpp = (f {Xp)(p{Xp) is the kernel matrix of the p’th view. The derivation is similar to Theorem 2. 
Here /Ci 2 ...m = ^ 12 ...™ x 1 4’'^{Xx) X 2 (f^{X 2 ) ■.. x^ (j)^{Xm) and can be calculated according to the following 
theorem. 

Theorem 3. The following equality holds: 

1 ^ 

Cl2...m y-l(f>'^{Xi) X2 </>^(X2) . . . Xm4>'^{Xm) = — kjn o k2n o . . . o kmn, 

n=l 

in whieh kp^ = 0^(A’p)0(xp^), i.e., the n ^th eolumn of the kernel matrix Kpp, p = 1,..., m. 
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We give the proof in the Appendix. To avoid trivial learning, we follow Hardoon et al (2004) and introduce 
a partial least square (PLS) term to penalize the norms of the weight vectors {a^}. That is, the constraints 
of problem (4.13) become 

(Kpp + eKpp)ap = l,p= (4.14) 

Because the matrix + ^Kpp) is positive definite, it has a unique Cholesky decomposition, and we 

can denote its decomposition as + ^Kpp) = L^Lp. Let = Lp3.p and S = /Ci2...m Xi (^r^)^ >^2 
(I/^^)^ ... Xm (I/“^)^, we can reformulate (4.13) as 

argmax p =S x 1 bf x 2 b^ ... x ^ b^ 

{bp} (415) 

s.t. bjbp = l,p = 1 ,... ,m. 

Similar to TCCA, this problem is equivalent to finding the best rank-1 approximation of 5, and the solution 
can be found using the ALS algorithm. By recursively maximizing the correlation, we obtain b^^^, • • •, bp’^^. 
Let Bp = [bp^\ ... and the canonical variables Zp = 0^(Xp)hp = c/r^{Xp)(t){Xp)ap = KppL~^hp and 

the projected data for the p’th view are then 


Zp = KppLp ^Bp. 


(4.16) 


The concatenated Z G is the final representation of the instances. 

4.5 Complexity Analysis 

The time and space complexities of the proposed TCCA model are both closely related to the size of tensor Xi. 
Straightforwardly, the space complexity is 0{did2 ... Because the tensor X4 can be calculated offline, 
the time complexity is dominated by the rank-r decomposition using the ALS algorithm. Considering that 
it is common that r ^ min((ii, (i 2 ,..., d^), we can speculate the time complexity of ALS is 0{trdid2 .. .dm) 
according to Comon et al (2009), where the time cost of the ALS algorithm for the three modes tensor is 
presented. Here, t is the number of iterations in ALS. 

According to the above analysis, we can see that the complexity of TCCA is independent of the number of 
instances, and thus our method can be scaled in very large sample size problems. Similarly, the complexities 
of KTCCA are determined by the tensor 5, the size of which is N'^. The space and time complexities are 
0{N^) and 0{trN^) respectively. This means that KTCCA is capable of being scaled in problems that 
have very high feature dimensions and a small number of instances. 

5 Experiments 

In this section, we empirically validate the effectiveness of the proposed TCCA on a biometric structure 
prediction and an advertisement classification problem following Foster et al (2008), as well as on a challenging 
web image annotation task Chua et al (2009). In all of the following experiments, five random choices of 
the labeled instances are used. Twenty percent of the test data (or unlabeled data in the transductive 
setting) are used for validation, which means that the parameters (if not specified) corresponding to the best 
performance on the validation set are used for testing. The evaluation criterion is the classification accuracy. 

5.1 Evaluation of the Linear Formulation 

In the first two sets of experiments (biometric structure prediction and advertisement classification), we use 
regularized least squares (RLS) as the base learner following Foster et al (2008). Given Ni labeled instances 
{{xn,yn)}nLi, the optimization problem for RLS is given by argmin^ E^)}i(w^x„-y„) 2 + 7 ||w|||, where 
the positive trade-off parameter 7 is is set as 10“^ according to Foster et al (2008). A constant feature of 1 
is appended to each instance to include a bias term in w. In web image annotation, the /c-nearest-neighbor 
(/cNN) classifier is utilized, where the candidate set for k is {1, 2,..., 10}. Specifically, we compare the 
following methods: 
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• BSF: using the single view feature that achieves the best performance in RLS//cNN-based classification. 

• CAT: concatenating the normalized features of all the views into a long vector, and then performing 
RLS//cNN-based classification. 

• CCA Foster et al (2008): using the CCA formulation presented in Foster et al (2008) to find a 
common representation of two different views. In this formulation, a regularization term el is added 
to control the model complexity, and we set the parameter e as 10“^ in biometric structure prediction 
and advertisement classification according to Foster et al (2008). The parameter is tuned over the set 
{10*= —5,..., 4} in web image annotation. The implementation details can be found in Foster et al 
(2008). For m different views, there are m{m — l)/2 subsets of two views. The subset that achieves 
the best performance is termed CCA (BST). To combine the results of all subsets, we average their 
predicted scores in RLS-based classification and adopt the majority voting strategy in kNN. This 
combination approach is termed CCA (AVG). 

• CCA-LS Via et al (2007): a generalization of CCA to multiple views based on least square (LS) 
regression. 

• DSE Long et al (2008): a general and popular unsupervised multi-view dimension reduction method 
based on spectral embedding. 

• SSMVD Han et al (2012): a recently proposed unsupervised multi-view dimension reduction 
method based on the structured sparsity-inducing norm Jenatton et al (2011). 

• TCCA: the proposed tensor CCA. The regularization parameter e is optimized the same as in CCA. 

In the first step of DSE and SSMVD, PCA is taken as the dimension reduction method for each view, and 
the result dimension (of each view) is set to be 100 empirically. 

5.1.1 Biometric Structure Prediction 

The dataset used in this set of experiments is SecStr^, which is a benchmark dataset for evaluating semi- 
supervised systems Chapelle et al (2006). The task associated with this dataset is “to predict the secondary 
structure of a given amino acid in a protein based on a sequence window centered around that amino acid” 
Chapelle et al (2006). The SecStr dataset is large-scale and contains 84A instances. We randomly select 
100 instances as labeled samples. There are also 1200A unlabeled instances which we use to observe the 
performance of three CCA-based methods (CCA, CCA-LS and TCCA) with respect to different amounts of 
unlabeled data. Following Foster et al (2008), all the provided data are used (as unlabeled instances) to find 
the common subspace in the CCA-based methods. The performance is evaluated in a transductive setting 
on the unlabeled samples (except those for validation) of the 84A instances. Both DSE and SSMVD are 
naturally transductive, since they learn the low-dimensional representation of given data directly, and no 
projection matrix is learned for new data. Therefore, these two methods cannot handle very large datasets 
and the experiments are conducted only on the 84A instances. In particular, DSE needs to solve an eigen- 
decomposition problem of size N x N. The time cost or memory cost is intolerable when N is 84A, and 
thus a subset of lOK samples are utilized. 

The features provided are 15 categorical attributes, each of which is generated at a position in [—7, +7] 
from the sequence window of amino acid, and represented by a 21-dimensional sparse binary vector. We 
divided the 315(15 X 21) features into three views: 

• View-1: attributes based on the left context (positions in [—7, —3]); 

• View-2: attributes based on the current position and middle context (positions in [—2,2]); 

• View-3: attributes based on the right context (positions in [3,7]). 

The dimension of each view is 105. 

^http://WWW.kyb.tuebingen.mpg.de/ssl-book 
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Figure 3: Prediction accuracy vs. dimension of the common subspace on the SecStr dataset. (Top: 100 
labeled instances and 84i^ unlabeled instances; Bottom: 100 labeled instances and the entire unlabeled set 
(about 1.3M instances).) 


Table 1: Prediction accuracies (%) of the different methods at their best dimensions on the SecStr dataset 
(100 labeled instances). 


Methods 

#unlabeled = 84LC 

#unlabeled = 1.3M 

BSF 

57.48T1.90 

CAT 

57.77±2.03 

CCA (BST) 

58.78±2.97 

59.97±2.46 

CCA (AVG) 

60.75T1.92 

61.15T1.73 

CCA-LS 

60.23±1.70 

61.32T1.65 

DSE 

60.15T0.81 

No Attempt 

SSMVD 

61.08T1.58 

TCCA 

62.36T1.27 

64.42T1.70 


The performance of the compared methods in relation to the dimension of the common subspace is shown 
in Fig. 3. Accuracy is averaged over 5 runs for each dimension r in {5,10,..., 100,110,..., 200, 220,..., 300}. 
The performance of the different methods at their best dimensions are summarized in Table 1. From the 
results, we observe that: 1) the concatenation strategy (CAT) is comparable to and slightly better than 
the strategy of only using the best single view features (BSF); 2) by learning the common subspace, all the 
compared multi-view dimension reduction methods are significantly better than the BSF and CAT baselines, 
if the dimensionalities are properly set according to the accuracy on the validation dataset. In particular, 
CCA (BST) is superior to CAT, although only a subset of two views is utilized in the former; 3) the accuracy 
of all three CCA-based methods increases with an increasing number of unlabeled data. By combining the 
results of different subsets, CCA (AVG) is better than CCA (BST); 4) CCA-LS is superior to CCA (BST), 
but their performance at their best dimension is comparable. When the number of unlabeled data is 84i^, 
DSE and SSMVD are comparable to CCA (BST) and CCA-LS respectively; 5) the performance of TCCA 
does not decease significantly as CCA-LS and CCA do when the number of dimensions is high. The main 
reason is that the ALS algorithm used in TCCA seeks to maximize the canonical correlations for all the r 
factors simultaneously, but not to greedily find orthogonal decomposition components Allen (2012). That is, 
the main variance tends to be explained uniformly by all factors, not only by the first several factors. This is 
also the reason why there are some oscillations in TCCA; 6) the proposed TCCA significantly outperforms all 
the other methods on most dimensionalities. This demonstrates that the high order correlation information 
between all features is well discovered, and that exploring this kind of information is much better than only 
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Figure 4: Classification accuracy vs. dimension of the common subspace on the Ads dataset. 100 labeled 
training samples are utilized. 


exploring the correlation information between pairs of features, as in CCA-LS. 

5.1.2 Advertisement Classification 

This set of experiments is conducted on the Ads (internet advertisements)^ dataset from the well-known UCI 
Machine Learning Repository. The task is to predict whether or not a given hyperlink (associated with an 
image) is an advertisement. There are 3, 279 instances in this dataset. We randomly choose 100 instances as 
labeled training samples, and all the instances except those for validation are utilized as unlabeled samples 
to find the common subspace. The performance is evaluated in a transductive setting on the unlabeled 
samples. 

We use the features as described in Kushmerick (1999), and omit the attributes that have missing values, 
such as the height (and width) of the image. The remained attributes are represented by binary (1/0) 
features which indicate the presence/absence of corresponding terms. For CCA-LS and TCCA, we divide 
all these features into three views as follows: 

• View-1: features based on the terms in the images URL, caption, and alt text. 588 dimensions; 

• View-2: features based on the terms in the URL of the current site. 495 dimensions; 

• View-3: features based on the terms in the anchor URL. 472 dimensions. 

Fig. 4 shows the classification accuracy of the compared methods (in relation to the dimension r), and 
the accuracies at their best dimensions are summarized in Table 2. In contrast to the observations of the last 
set of experiments, we can see that: 1) the accuracy of the concatenation strategy (CAT) and the best single 
view (BSF) are almost the same. The performance of CAT is relatively worse since the feature dimension 
in this set of experiments is high (1,555 dimensions), and over-fitting occurs given the limited number of 
labeled samples; 2) the performance of DSE and SSMVD first increase and then decrease sharply with an 
increasing number of the dimension r, while the CCA-based methods are much steady; 3) the improvement 
of TCCA compared with the other CCA-based methods is not as great as in the last set of experiments. 
This is because we need more samples to approximate the true underlying high order correlation compared 
with the traditional pairwise correlation, since there are more variables to be estimated in the high order 
statistics. The unlabeled instances utilized in this set of experiments are much fewer, thus the high order 
correlation information is not well explored. CCA-LS is only comparable to CCA for the same reason. 

^http://archive.ics.uci.edu/ml/datasets/Internet+Advertisements 
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Table 2: Classification accuracies (%) of the different methods at their best dimensions on the Ads dataset. 


Methods 

#labeled = 100 

BSF 

91.10T1.65 

CAT 

91.08T1.74 

CCA (BST) 

92.88Tl.il 

CCA (AVG) 

93.84±0.85 

CCA-LS 

93.17T1.10 

DSE 

93.01T0.96 

SSMVD 

92.99T0.91 

TCCA 

94.59T0.27 



Figure 5: Anotation accuracy vs. dimension of the common subspace on the NUS-WIDE mammal subset. 
(Left: 4 labeled instances for each mammal concept; Middle: 6 labeled instances; Right: 8 labeled instances.) 


5.1.3 Web Image Annotation 

We further verify the effectiveness of the proposed algorithm on a natural image dataset NUS-WIDE Chua 
et al (2009). This dataset contains 269, 648 images, and our experiments are conduct on a subset that consists 
of 11,189 images belonging to 10 mammal concepts: bear, cat, cow, dog, elk, fox, horse, tiger, whale, and 
zebra. We randomly split the images into a training set of 5, 597 images and a test set of 5, 592 images. 
Distinguishing between these concepts is very challenging, since many of them are similar to each other, e.g., 
cat and tiger. We randomly choose {4, 6, 8} labeled instances for each concept in the training set, and all 
the training instances are utilized as unlabeled samples to find the common subspace. 

In this dataset, we choose three types of visual feature, namely 500-D bag of visual words based on SIET 
Lowe (2004) descriptors, 144-D color auto-correlogram, and 128-D wavelet texture, to represent each image 
Chua et al (2009). 

The annotation performance of the compared methods is shown in Eig. 5 and Table 3. It can be seen 
from the results that: 1) in general, performance improves with an increased number of labeled instances; 
2 ) CCA-LS is comparable to CCA (BST) and CCA (AVG), while the best performance (peak of the curve) 
of CCA-LS is usually higher; 3) the performance of DSE is poor when r is large, while SSMVD is much 
steady and can be superior to CCA (AVG) and CCA-LS sometimes; 4) the accuracies of CCA (AVG) and 
CCA-LS first increase and then decrease with an increasing number of the dimension r, while the results of 
the proposed TCCA are satisfactory even though r is large; 3) the accuracy of TCCA is significantly better 
than that of all the other methods under most dimensionalities. 
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Table 3: Annotation accuracies (%) of the different methods at their best dimensions on the NUS-WIDE 
mammal dataset. _ 


Methods 

^labeled = 4 

^labeled = 6 

^labeled = 8 

BSF 

17.42T1.37 

18.37T1.21 

19.96T1.19 

CAT 

19.01T1.86 

19.07T2.23 

20.70T1.44 

CCA (BST) 

20.77T1.52 

21.51T2.38 

22.61T1.76 

CCA (AVG) 

21.21T1.47 

21.57T2.04 

22.61T1.21 

CCA-LS 

20.90T1.84 

22.31T2.53 

23.50T2.48 

DSE 

20.02T1.23 

21.59T1.13 

22.67T0.74 

SSMVD 

21.34T2.08 

23.32T1.08 

23.79T1.30 

TCCA 

22.40T1.96 

23.86T1.41 

24.11T0.32 


5.2 Evaluation of the Non-linear Extension 

We evaluate the non-linear extension of the proposed TCCA in the web image annotation task. As discussed 
in Section 4.5, the non-linear extension is able to handle the small sample size problem, where the feature 
dimensions can be very high and possibly infinite. We thus randomly choose a small set of 500 samples from 
the animal subset. To perform the non-linear classification, we construct a kernel for each kind of feature. 
The kernel is defined by 

k{xi,Xj) = exp(-A“^d(xi,Xj)), 

where d{xi,Xj) denotes the distance between Xj and Xj, and A = maXijd{xi,Xj). We choose the distance 
for the visual word histogram. For other features, the L2 distance is utilized. Specifically, we compare the 
following methods: 

• BSK: using the single view kernel that achieves the best performance in the /cNN-based classification. 

• AVG: averaging the normalized kernels of all the views, and then performing /cNN-based classification. 

• KCCA Hardoon et al (2004): using the KCCA formulation presented in Hardoon et al (2004) to 
find a common representation of two different views. The regularization parameter is optimized over 
the set {10*|i = —7, ... ,2}. The setup of KCCA (BST) and KCCA (AVG) are similar as CCA 
(BST) and CCA (AVG) in the experiments of the linear version. 

• KTCCA: the non-linear extension of the proposed tensor CCA. The regularization parameter e is 
optimized in the same way as in KCCA. 

The experimental results are shown in Fig. 6 and Table 4. Compared with the results in Fig. 5, we 
can see that: 1) although a small number of unlabeled samples is utilized, the performance is better since 
the separability is improved by the non-linear projection, which is implemented via the kernel trick Shawe- 
Taylor and Cristianini (2004); 2) the simple AVG view combination strategy outperforms the best single 
view kernel (BSK) significantly, and is comparable to KCCA (BST); 3) KCCA (AVG) is slightly better than 
KGCA (BST), and the proposed KTGCA achieves the best performance under most dimensionalities. 

5.3 Empirical analysis of the computational complexity 

In this subsection, we empirically analyze the computational complexity of the different methods. The 
experiments are conducted in Mat lab R2012b on a 2 x 3.33 GHz Intel Xeon (6 cores) computer, where the 
memory is 48GB 1333MHz ECC DDR3-RAM. The results (time cost and memory cost) on the different 
datasets are shown in Fig. 7-10. From the results, we observe that: 1) the costs of the proposed TCCA are 
higher than the other CCA-based methods in general. This is because the decomposition is performed on 
a large di x d 2 x ... x dm covariance tensor, instead of one or multiple dp x dq covariance matrices, where 
p, g = 1 ,..., m are the view indices. The tensor decomposition method we adopt in this paper is the ALS 
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Figure 6: Annotation accuracy (of the non-linear methods) vs. dimension of the common subspace on the 
NUS-WIDE mammal subset, where a small set of 500 samples is utilized. (Left: 4 labeled instances for each 
mammal concept; Middle: 6 labeled instances; Right: 8 labeled instances.) 

Table 4: Annotation accuracies (%) of the different non-linear methods at their best dimensions on the 
NUS-WIDE mamm al dataset. _ 


Methods 

^labeled = 4 

^labeled = 6 

^labeled = 8 

BSK 

17.96T1.29 

19.17T2.01 

20.04T1.66 

AVG 

20.49T1.65 

21.73T2.74 

22.86T1.87 

KCCA (BST) 

21.51T2.44 

22.58T1.91 

23.78±1.57 

KCCA (AVG) 

21.85T1.38 

23.13T1.77 

24.28T1.04 

KTCCA 

24.51T0.78 

25.18T0.58 

25.74T0.90 


algorithm Kroonenberg and De Leeuw (1980); Comon et al (2009), which could result in satisfactory accuracy 
but is not efficient; 2) TCCA is much more efficient than DSE or SSMVD when the feature dimensions are not 
very high and the number of instances is large (see Eig. 7 for example). This demonstrates the superiority 
of TCCA compared with the existed unsupervised multi-view dimension reduction methods on the large 
sample size problems. 

6 Conclusion 

Standard CCA cannot deal with multi-view data, and its typical multi-view extensions ignore the high order 
statistics (correlation information) among all feature views. To resolve this problem, we have presented 
tensor CCA (TCCA) to discover such statistics by analyzing the covariance tensor of all views. 

Erom the experimental validation on a variety of application tasks, we conclude that: 1) finding a 
common subspace for all views using the CCA-based strategy is often better than simply concatenating 
all the features, especially when the feature dimension is high; 2) examining more statistics, which may 
require more unlabeled data to be utilized, often leads to better performance; 3) by exploring the high order 
statistics, the proposed TCCA outperforms the other methods, especially when the dimension of the common 
subspace is high. 

Compared with CCA and its traditional multi-view extensions, the main disadvantage of the proposed 
TCCA is the high computational cost. Most of the TCCA cost lies in the tensor decomposition, which is 
not the point of this paper. In the future, we will devote efficient tensor decomposition methods that could 
speed up TCCA, or introduce the parallel computing technique by utilizing GPU to accelerate the ALS 
tensor decomposition. 
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Figure 7: Computational complexity vs. dimension of the common subspace on the SecStr dataset. (Top: 
time cost in seconds; Bottom: memory cost in Megabits.) 
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Figure 8: Computational complexity vs. dimension of the common subspace on the Ads dataset. 100 labeled 
training samples are utilized. (Top: time cost in seconds; Bottom: memory cost in Megabits.) 



Figure 9: Computational complexity vs. dimension of the common subspace on the NUS-WIDE mammal 
subset. 6 labeled samples for each mammal concept are utilized. (Top: time cost in seconds; Bottom: 
memory cost in Megabits.) 
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Figure 10: Computational complexity (of the non-linear methods) vs. dimension of the common subspace on 
the NUS-WIDE mammal subset, where a small set of 500 instances and 6 labeled samples for each mammal 
concept are utilized. (Top: time cost in seconds; Bottom: memory cost in Megabits.) 

A Proof of Thoerem 1 

Proof. According to the definition of the element-wise product, we have 

N N m N m / dp \ 

p =(zi ©Z 2 © ... ©z„)^e = y;zi(n)z 2 (n).. .z„(n) = y] llzp{n) = XTl ( E Vn(j'p)h(jp) , 

n=l n=lp=l n=lp=l \jp=l J 

(A.l) 

where Zp{n) denotes the n’th entry of the vector z^, and the same notation is used for and h. Additionally, 

N N m 

Cl2...ni{jl,j2, • • • ,im) = E Xi„(ji)x2„(j2) • ■-X^nUm) = En XpnUp 

n=l n=lp=l 

According to the definition of the p-mode product of a tensor and matrix, we have 

(c Xp h.p )(ji,..., ip—1,1, ip+i ’!••••) jm) 
dp dp / TV TTi \ N dp / 777, 

~ C(ii,i2 5 • • • 5im)h(ip) = f ^pn{jp) j h(ip) = f ^pn{jp) 

jp — 1 jp — 1 \?7/—1 p —1 / Tl —1 jp — 1 \.p—1 

Therefore, 

N m / dp 

(C Xi hf X2 hi’... h|^)(l,..., 1,1,1,..., 1) = y; JJ I Xp„(jp)h(7 

n=lp=l \jp = l 

This completes the proof. 

B Proof of Theorem 3 

Proof. Let T = Ci2...m Xi X 2 (p'^{X 2 )... x„ and G = jf Yln=i ki„ o k 2 „ o ... o then 

according to the definition of the outer product, the (^ 1 ,^ 2 , • • • entry of G is 

N 

G{jl,j2 •)'•'•) jm) — E kl77,(il)k277,(i2) • • • ^777,77, (^jm) (B.l) 
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where kp^(jfp) is the jp’th element of the vector l^pn^P = 1 ,..., m. Additionally, the (ji, , jm)’th entry 

of C is 

N 

C{jlj2 5 • • • 5 jm) = E 01n(H)02n(^2) • • • 0 mn i'^m ), (B.2) 

n=l 

where (ppni'ip) is the ip\h element of the vector (j){xpn)^p = 1 ,... ,m. According to the definition of the 
tensor-matrix product, we have 

{CXpcj) (A^p))(ii,..., ip_i, jfp, 

'^m) 

Dp 

~ ^ ^ i25 • • • 5 '^m)4^p {jp^ '^p) 

ip — 1 

Dp N 

= EE 01n(^l)02n('^2) • • • 4^rnn{'^m)4^p i^jpi V) 

ip = l n=l 

N Dp 

= 01n(H) • • • 0p-l,n(V-l)0p+l,n(Wl) * * * ^ iip)4>pijp,ip) 

n=l *p = l 

N 

= ^ 01n(n) • • • 0p-l,n(V-l)0p+l,n(Wl) * * * ^ mn{iva)^pn i.jp^ 

n=l 

Then the ■ ■ ■ ,im)’th entry of J" is 

• • • ,im) =(C Xi 4>'^{Xi) X2 4>^{X2) ...y-m 'P'{Xm)){jl, j2, ■ ■ ■ , jni) 

^ , . (B.3) 

~ / ^ ki^(ji)k2rt{j2) • • • kmn(jm)* 

n=l 

By comparing (B.l) and (B.3), we complete the proof. □ 
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